SelfAskRefusalScorer honors partial_content on blocked pieces by tejas0077 · Pull Request #2083 · microsoft/PyRIT

tejas0077 · 2026-06-25T19:36:22Z

Fixes #2044 (sub-issue #2)

SelfAskRefusalScorer unconditionally returned refusal=True when response_error == "blocked", even when partial_content was available in prompt_metadata. This silently dropped potentially successful jailbreaks from red-team results — the most evasive successes were exactly the ones being missed.

The fix sets score_blocked_content = True on SelfAskRefusalScorer so the base Scorer class handles partial content substitution via the existing _apply_blocked_content_substitution mechanism. When a blocked piece has partial_content, it is now scored via the LLM instead of being unconditionally treated as a clean refusal.

The rationale string for blocked responses with no partial content has also been updated to be more descriptive.

Tests and Documentation

Updated the existing test_score_async_filtered_response test to match the new rationale string and added a new test test_score_async_blocked_with_partial_content_scores_partial that verifies blocked pieces with partial content are forwarded to the LLM scorer instead of immediately returning refusal=True.

romanlutz

Thanks for the contribution. I think this could be done either way (default on or default off). I'll defer to @jsong468 since he looked into this in more detail.

jsong468

Hi! In my head, if the model or application filtered out/blocked a response (even if it was already being generated), this should be considered a refusal, which is why the default for score_blocked_content is False. If a user did want to score partial content, they could scorer.score_blocked_content = True to enable that behavior. Let me know if that makes sense or if you think otherwise :)

romanlutz · 2026-06-27T06:19:29Z

Hi! In my head, if the model or application filtered out/blocked a response (even if it was already being generated), this should be considered a refusal, which is why the default for score_blocked_content is False. If a user did want to score partial content, they could scorer.score_blocked_content = True to enable that behavior. Let me know if that makes sense or if you think otherwise :)

That's fair. In that case, this PR is actively doing the opposite of what I'd expect the default to be and we can close it. Feel free to comment @tejas0077 if you have other thoughts.

fix: SelfAskRefusalScorer honors partial_content on blocked pieces

8f19cb6

tejas0077 mentioned this pull request Jun 25, 2026

Scorers conflate couldn't-score / errored / blocked / hedged with attack-did-not-succeed, under-reporting jailbreaks #2044

Closed

Merge branch 'main' into fix/refusal-scorer-partial-content

88db053

romanlutz reviewed Jun 26, 2026

View reviewed changes

jsong468 reviewed Jun 26, 2026

View reviewed changes

romanlutz closed this Jun 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SelfAskRefusalScorer honors partial_content on blocked pieces#2083

SelfAskRefusalScorer honors partial_content on blocked pieces#2083
tejas0077 wants to merge 2 commits into
microsoft:mainfrom
tejas0077:fix/refusal-scorer-partial-content

tejas0077 commented Jun 25, 2026

Uh oh!

romanlutz left a comment

Uh oh!

jsong468 left a comment •

edited

Loading

Uh oh!

romanlutz commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

tejas0077 commented Jun 25, 2026

Tests and Documentation

Uh oh!

romanlutz left a comment

Choose a reason for hiding this comment

Uh oh!

jsong468 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

romanlutz commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jsong468 left a comment •

edited

Loading